Introduction

I chose to do my final project on the Titanic Data Set. I chose this particular dataset because of it’s popularity among young data scientists. It is one of the easiest datasets to begin with for learning to build a predictive model. I wanted to get familiar with it so that I can do this myself in the near future, but also thought it would be a fun topic to do the project on.

The dataset can be found at:

https://www.kaggle.com/c/titanic & https://raw.githubusercontent.com/rashida048/Datasets/master/titanic_data.csv

Kaggle has an ongoing model competition to see who can build one with the best accuracy

My Question is: Can you predict the likelihood of surviving the Titanic disaster? What characteristics of a person can increase/decrease chances of survival?

The Data

The Titanic dataset contains information about each passenger of the Titanic and their fate.

PassengerID

Unique ID for each Passenger (dbl)

Survived

Indicates if passenger survived (dbl): 0 = died, 1 = survived

Pclass

Ticket Class (dbl): 1 = 1st, 2 = 2nd, 3 = 3rd

Name

Name of passenger (chr)

Sex

Sex of passenger (chr)

Age

Age of passenger (chr)

SibSp

Number of siblings/spouse (dbl): Sibling = Brother, Sister, Stepbrother, Stepsister Spouse = Husband, Wife (Mistresses and Fiances ignored)

Parch

Number of parents/children (dbl): Parent = Mother, Father Spouse = Daughter, Son, Stepdaughter, Stepson (Children with nanny = 0)

Ticket

Ticket Number of passenger (chr)

Fare

Price of passenger ticket (dbl)

Cabin

Cabin Number of passenger (chr)

Embarked

Port of Embarkation (chr): Q = Queenstown,S = Southampton, C = Cherbourg

Clean/Tidy Data

## # A tibble: 6 × 12
##   PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
## 2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
## 3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
## 4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
## 5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
## 6           6        0      3 Moran,… male     NA     0     0 330877  8.46 <NA> 
## # … with 1 more variable: Embarked <chr>
## # A tibble: 6 × 5
##   Survived Pclass Sex    Embarked    cat  
##   <chr>    <chr>  <chr>  <chr>       <chr>
## 1 Died     Third  Male   Southampton Adult
## 2 Survived First  Female Cherbourg   Adult
## 3 Survived Third  Female Southampton Adult
## 4 Survived First  Female Southampton Adult
## 5 Died     Third  Male   Southampton Adult
## 6 Died     Third  Male   Queenstown  <NA>

Exploring Relationships

Corrplot

Some correlations to note are:

Class & Fare Price:

Class is cateogorical and is split into 3 categories of (1,2,3). Fare is numeric. There is a moderate negative correlation (-0.55) between the two variables. This indicates that as the price of the fare increases, class decreases. The assumption that higher fares are associated with 1st class (1) can already be made, but it is nice knowing that it is statistically correct as well.

Age & Class:

Low negative correlation (-0.37). Indicates as Age increases, class decreases (closer to 1st class). This assumption can be made that older people have more money than younger passengers.

The corrplot shows a few relationships when it comes to the ‘Survived’ variable:

Class:

Class has a low negative correlation (-0.36). Class gets worse (economically) as it increases. (1st class is 1, 2nd is 2, and 3rd is 3). Therefore, wealthier people are more likely to be part of first class. ‘Survived’ indicates that the passenger died (0) or survived (1), meaning the higher the variable, the better chances of survival. Therefore, the correlation indicates that as survival increases, class decreases. This indicates that wealthier people (people of 1st class), or more likely to survive

Sex:

Sex has a moderate positive correlation (0.54). As sex increases (man (1) to woman (2)), so does survival (death (1) to survival (2)). This indicates that women were more likely to survive.

Summary of Passengers

Demographics

Age Categories:
Baby (0-2)
Toddler (2-5)
Child (5-13)
Teen (13-20)
Adult (20-40)
MAA (40-60)
Senior (60+)

This plot shows that mostly Adults (Ages 20-40) were on board. Middle-Aged Adults the second most common age category. There appears to be an upwards trend until Adults, and a downwards trend following Adults. There are also more male passengers than female passengers in just about every age category.

Port/Orgin

This Map shows the 3 ports that the Titanic departed from. Southampton can be observed as the port with the largest number of passengers. A little of over half of the passengers from Southampton are third class. The remaining are split up somewhat evenly. The second largest port is Cherbourg, which had over half of its passengers in first class. The smallest port is Queenstown, which had 77 passengers and 72 of them were third class.

Class/Fare

The plot shows the volume of passengers, as well as fare price for each class. Third class is the largest, first class in the middle, and second class last largest. As expected, the price goes up as you get closer to First Class.

Analysis

Demographics

This plot shows the survival rate of passengers by their age category and gender. It can be observed that Females survived at a significanly higher rate than Males. For women, it appears that the survival rate for babies and seniors was 100%. All other cateogories seem to be similar. For men, the chances of survival decrease as age increases.

Port/Orgin

It can be observed that Southampton had the largest fatality rate of 66.3% of its passengers. Queenstown was a close second at 61%. Cherbourg was the only one at less than 50% at 44.6%.

Class/Port

This graph shows the death rate of individuals based on their port and class. The common theme was that you were more likely to die as you get closer to third class. Southhampton appeared to have a larger death rate in all categories. Maybe passengers from Southampton were put in a similar (unlucky) part of the ship?

Alone/Family

This graph shows the Survival Rate of individuals based on whether or not they had family on board. In general, 51.6% of people with family on board survived. 33.2% of people traveling alone died. Men who were alone had a 16.8% survival rate, while men with family had a 28.2% survival rate. Women traveling alone had 79% survival rate. Women with family had a 73.3 % survival rate. This indicates that men traveling alone likely died.

Gender, Class, and Age

Histogram

This graph shows the groups of people by their Age, Sex, and Class, which have been identified as the most important variables. Once again, it is shown that men have lower Survival Rates than women. You can also observe that the classes change closer to first class as survival rate increases.

Scatter Plot

You can see that most people in the ‘All Survived’ cateogory are young and upper-class. ‘Mostly survived’ shows mostly upper-class women. ‘Mostly died’ shows mostly lower-class men. ‘All Died’ appears to have outliers.

All Survived

## # A tibble: 13 × 5
## # Groups:   cat, Sex, Pclass [13]
##    cat     Sex    Pclass group1        CSCT
##    <chr>   <chr>  <chr>  <chr>        <int>
##  1 Senior  Female First  All Survived     3
##  2 Teen    Female First  All Survived    13
##  3 Child   Female Second All Survived     4
##  4 Teen    Female Second All Survived     8
##  5 Toddler Female Second All Survived     4
##  6 Baby    Female Third  All Survived     4
##  7 Senior  Female Third  All Survived     1
##  8 Baby    Male   First  All Survived     1
##  9 Child   Male   First  All Survived     1
## 10 Toddler Male   First  All Survived     1
## 11 Baby    Male   Second All Survived     5
## 12 Child   Male   Second All Survived     1
## 13 Toddler Male   Second All Survived     3

Mostly Survived

## # A tibble: 8 × 5
## # Groups:   cat, Sex, Pclass [8]
##   cat     Sex    Pclass group1           CSCT
##   <chr>   <chr>  <chr>  <chr>           <int>
## 1 Adult   Female First  Mostly Survived    43
## 2 MAA     Female First  Mostly Survived    25
## 3 Adult   Female Second Mostly Survived    42
## 4 MAA     Female Second Mostly Survived    16
## 5 Teen    Female Third  Mostly Survived    22
## 6 Toddler Female Third  Mostly Survived     8
## 7 Adult   Male   First  Mostly Survived    41
## 8 Baby    Male   Third  Mostly Survived     4

Mostly Died

## # A tibble: 14 × 5
## # Groups:   cat, Sex, Pclass [14]
##    cat     Sex    Pclass group1       CSCT
##    <chr>   <chr>  <chr>  <chr>       <int>
##  1 Adult   Female Third  Mostly Died    47
##  2 Child   Female Third  Mostly Died    11
##  3 MAA     Male   First  Mostly Died    39
##  4 Senior  Male   First  Mostly Died    14
##  5 Teen    Male   First  Mostly Died     4
##  6 Adult   Male   Second Mostly Died    59
##  7 MAA     Male   Second Mostly Died    17
##  8 Senior  Male   Second Mostly Died     4
##  9 Teen    Male   Second Mostly Died    10
## 10 Adult   Male   Third  Mostly Died   155
## 11 Child   Male   Third  Mostly Died    12
## 12 MAA     Male   Third  Mostly Died    31
## 13 Teen    Male   Third  Mostly Died    38
## 14 Toddler Male   Third  Mostly Died     9

All Died

## # A tibble: 3 × 5
## # Groups:   cat, Sex, Pclass [3]
##   cat     Sex    Pclass group1    CSCT
##   <chr>   <chr>  <chr>  <chr>    <int>
## 1 Toddler Female First  All Died     1
## 2 MAA     Female Third  All Died     9
## 3 Senior  Male   Third  All Died     4

Conclusion

In conclusion, Sex appears to be the most important variable when it comes to surviving. It almost every category (Class & Age), women survived at a higher rate than men. Second is Age. For men, age is almost perfectly linear. As the age of men increases, surviving rates decrease. For women, age is not as important. Outside of baby girls and elderly women, age does not play a large role for the other age groups. Class has a semi-strong correlation when it comes to surviving. The closer to first class, the better the chances of survival. Port of Embarkation also appears to be significant, but this is likely due to factors such as the differences of class, age, and sex for each port. However, if port of Embarkation were to affect where a passenger was placed on the ship, it could be a very important variable. Lastly, it is important to note that having family on the ship was important. I assume that it would be difficult to get on a life boat if there is nobody to encourage you getting a spot.